75 research outputs found
GEMBA-MQM: Detecting Translation Quality Error Spans with GPT-4
This paper introduces GEMBA-MQM, a GPT-based evaluation metric designed to
detect translation quality errors, specifically for the quality estimation
setting without the need for human reference translations. Based on the power
of large language models (LLM), GEMBA-MQM employs a fixed three-shot prompting
technique, querying the GPT-4 model to mark error quality spans. Compared to
previous works, our method has language-agnostic prompts, thus avoiding the
need for manual prompt preparation for new languages.
While preliminary results indicate that GEMBA-MQM achieves state-of-the-art
accuracy for system ranking, we advise caution when using it in academic works
to demonstrate improvements over other methods due to its dependence on the
proprietary, black-box GPT model.Comment: Accepted to WMT 202
Large Language Models Are State-of-the-Art Evaluators of Translation Quality
We describe GEMBA, a GPT-based metric for assessment of translation quality,
which works both with a reference translation and without. In our evaluation,
we focus on zero-shot prompting, comparing four prompt variants in two modes,
based on the availability of the reference. We investigate nine versions of GPT
models, including ChatGPT and GPT-4. We show that our method for translation
quality assessment only works with GPT~3.5 and larger models. Comparing to
results from WMT22's Metrics shared task, our method achieves state-of-the-art
accuracy in both modes when compared to MQM-based human labels. Our results are
valid on the system level for all three WMT22 Metrics shared task language
pairs, namely English into German, English into Russian, and Chinese into
English. This provides a first glimpse into the usefulness of pre-trained,
generative large language models for quality assessment of translations. We
publicly release all our code and prompt templates used for the experiments
described in this work, as well as all corresponding scoring results, to allow
for external validation and reproducibility.Comment: Accepted in EAMT, 10 pages, 8 tables, one figur
Hybrid machine translation using binary classification models trained on joint, binarised feature vectors
We describe the design and implementation of a system combination method for machine translation output. It is based on sentence selection using binary classification models estimated on joint, binarised feature vectors. By contrast to existing system combination methods which work by dividing candidate translations into n-grams, i.e., sequences of n words or tokens, our framework performs sentence selection which does not alter the selected, best translation. First, we investigate the potential performance gain attainable by optimal sentence selection. To do so, we conduct the largest meta-study on data released by the yearly Workshop on Statistical Machine Translation (WMT). Second, we introduce so-called joint, binarised feature vectors which explicitly model feature value comparison for two systems A, B. We compare different settings for training binary classifiers using single, joint, as well as joint, binarised feature vectors. After having shown the potential of both selection and binarisation as methodological paradigms, we combine these two into a combination framework which applies pairwise comparison of all candidate systems to determine the best translation for each individual sentence. Our system is able to outperform other state-of-the-art system combination approaches; this is confirmed by our experiments. We conclude by summarising the main findings and contributions of our thesis and by giving an outlook to future research directions.Wir beschreiben den Entwurf und die Implementierung eines Systems zur Kombination von Ăbersetzungen auf Basis nicht modifizierender Auswahl gegebener Kandidaten. Die zugehörigen, binĂ€ren Klassifikationsmodelle werden unter Verwendung von gemeinsamen, binĂ€risierten Merkmalsvektoren trainiert. Im Gegensatz zu anderen Methoden zur Systemkombination, die die gegebenen KandidatenĂŒbersetzungen in n-Gramme, d.h., Sequenzen von n Worten oder Symbolen zerlegen, funktioniert unser Ansatz mit Hilfe von nicht modifizierender Auswahl der besten Ăbersetzung. Zuerst untersuchen wir das Potenzial eines solches Ansatzes im Hinblick auf die maximale theoretisch mögliche Verbesserung und fĂŒhren die gröĂte Meta-Studie auf Daten, welche jĂ€hrlich im Rahmen der Arbeitstreffen zur Statistischen Maschinellen Ăbersetzung (WMT) veröffentlicht worden sind, durch. Danach definieren wir sogenannte gemeinsame, binĂ€risierte Merkmalsvektoren, welche explizit den Merkmalsvergleich zweier Systeme A, B modellieren. Wir vergleichen verschiedene Konfigurationen zum Training binĂ€rer Klassifikationsmodelle basierend auf einfachen, gemeinsamen, sowie gemeinsamen, binĂ€risierten Merkmalsvektoren. AbschlieĂend kombinieren wir beide Verfahren zu einer Methodik, die paarweise Vergleiche aller Quellsysteme zur Bestimmung der besten Ăbesetzung einsetzt. Wir schlieĂen mit einer Zusammenfassung und einem Ausblick auf zukĂŒnftige Forschungsthemen
Can Machine Learning Algorithms Improve Phrase Selection in Hybrid Machine Translation
Abstract We describe a substitution-based, hybrid machine translation (MT) system that has been extended with a machine learning component controlling its phrase selection. Our approach is based on a rule-based MT (RBMT) system which creates template translations. Based on the generation parse tree of the RBMT system and standard word alignment computation, we identify potential "translation snippets" from one or more translation engines which could be substituted into our translation templates. The substitution process is controlled by a binary classifier trained on feature vectors from the different MT engines. Using a set of manually annotated training data, we are able to observe improvements in terms of BLEU scores over a baseline version of the hybrid system
Results from the ML4HMT-12 shared task on applying machine learning techniques to optimise the division of labour in hybrid machine translation
We describe the second edition of the ML4HMT shared task which challenges participants to create hybrid translations from the translation output of several individual MT systems. We provide an overview of the shared task and the data made available to participants before briefly describing the individual systems. We report on the results using automatic evaluation metrics and conclude with a summary of ML4HMT-12 and an outlook to future work
Findings of the 2019 Conference on Machine Translation (WMT19)
This paper presents the results of the premier shared task organized alongside the Conference on Machine Translation (WMT) 2019.
Participants were asked to build machine translation systems for any of 18 language pairs, to be evaluated on a test set of news stories. The main metric for this task is human judgment of translation quality. The task was also opened up to additional test suites to probe specific aspects of translation
Iterative Data Augmentation for Neural Machine Translation: a Low Resource Case Study for EnglishâTelugu
Telugu is the fifteenth most commonly spoken language in the world with an estimated reach of 75 million people in the Indian subcontinent. At the same time, it is a severely low resourced language. In this paper, we present work on EnglishâTelugu general domain machine translation (MT) systems using small amounts of parallel data. The baseline statistical (SMT) and neural MT (NMT) systems do not yield acceptable translation quality, mostly due to limited resources. However, the use of synthetic parallel data (generated using back translation, based on an NMT engine) significantly improves translation quality and allows NMT to outperform SMT. We extend back translation and propose a new, iterative data augmentation (IDA) method. Filtering of synthetic data and IDA both further boost translation quality of our final NMT systems, as measured by BLEU scores on all test sets and based on state-of-the-art human evaluation
Tumor Heterogeneity in Lymphomas: A Different Breed.
The facts that cancer represents tissues consisting of heterogeneous neoplastic, as well as reactive, cell populations and that cancers of the same histotype may show profound differences in clinical behavior have long been recognized. With the advent of new technologies and the demands of precision medicine, the investigation of tumor heterogeneity has gained much interest. An understanding of intertumoral heterogeneity in patients with the same disease entity is necessary to optimally guide personalized treatment. In addition, increasing evidence indicates that different tumor areas or primary tumors and metastases in an individual patient can show significant intratumoral heterogeneity on different levels. This phenomenon can be driven by genomic instability, epigenetic events, the tumor microenvironment, and stochastic variations in cellular function and antitumoral therapies. These mechanisms may lead to branched subclonal evolution from a common progenitor clone, resulting in spatial variation between different tumor sites, disease progression, and treatment resistance. This review addresses tumor heterogeneity in lymphomas from a pathologist's viewpoint. The relationship between morphologic, immunophenotypic, and genetic heterogeneity is exemplified in different lymphoma entities and reviewed in the context of high-grade transformation and transdifferentiation. In addition, factors driving heterogeneity, as well as clinical and therapeutic implications of lymphoma heterogeneity, will be discussed
Towards Automatic Face-to-Face Translation
In light of the recent breakthroughs in automatic machine translation
systems, we propose a novel approach that we term as "Face-to-Face
Translation". As today's digital communication becomes increasingly visual, we
argue that there is a need for systems that can automatically translate a video
of a person speaking in language A into a target language B with realistic lip
synchronization. In this work, we create an automatic pipeline for this problem
and demonstrate its impact on multiple real-world applications. First, we build
a working speech-to-speech translation system by bringing together multiple
existing modules from speech and language. We then move towards "Face-to-Face
Translation" by incorporating a novel visual module, LipGAN for generating
realistic talking faces from the translated audio. Quantitative evaluation of
LipGAN on the standard LRW test set shows that it significantly outperforms
existing approaches across all standard metrics. We also subject our
Face-to-Face Translation pipeline, to multiple human evaluations and show that
it can significantly improve the overall user experience for consuming and
interacting with multimodal content across languages. Code, models and demo
video are made publicly available.
Demo video: https://www.youtube.com/watch?v=aHG6Oei8jF0
Code and models: https://github.com/Rudrabha/LipGANComment: 9 pages (including references), 5 figures, Published in ACM
Multimedia, 201
Machine Translation Human Evaluation: an investigation of evaluation based on Post-Editing and its relation with Direct Assessment
In this paper we present an analysis of the two most prominent methodologies used for the human evaluation of MT quality, namely evaluation based on Post-Editing (PE) and evaluation based on Direct Assessment (DA). To this purpose, we exploit a publicly available large dataset containing both types of evaluations. We first focus on PE and investigate how sensitive TER-based evaluation is to the type and number of references used. Then, we carry out a comparative analysis of PE and DA to investigate the extent to which the evaluation results obtained by methodologies addressing different human perspectives are similar. This comparison sheds light not only on PE but also on the so-called reference bias related to monolingual DA. Also, we analyze if and how the two methodologies can complement each otherâs weaknesses
- âŠ